About Seurat and SeuratData

Seurat is an R package designed for QC, analysis, and exploration of single-cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single-cell transcriptomic measurements, and to integrate types of single-cell data. After this short introduction you can read Seurat offical website to dive a bit deeper.

SeuratData is a mechanism for distributing datasets in the form of Seurat objects using R’s internal package and data management systems. It represents an easy way for users to get access to datasets that are used in the Seurat vignettes.

Install Seurat and SeuratData

(If you have installed them before workshop, you do not need to run this block of code.)

install.packages('Seurat')

if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}

remotes::install_github("satijalab/seurat-data", quiet = TRUE)

1. Load Seurat and SeuratData

We will use Seurat V5 today, it was published last year. Seurat V5 has gradually gained popularity due to its faster running speed. However, Seurat V5 has some data structure changes compared with older versions of Seurat (V3 & V4), which may cause some old codes to fail to run. To know more please read this website.

library(Seurat)
options(Seurat.object.assay.version = "v5")
library(SeuratData)

We will use pbmcsca dataset today. This public dataset includes single-cell RNA-seq data of human peripheral blood mononuclear cells (PBMCs) using multiple sequencing platforms. Only data that passed quality control are included in the pbmcsca dataset.

data("pbmcsca")
pbmcsca <- UpdateSeuratObject(pbmcsca)
## Warning: Assay RNA changing from Assay to Assay
table(pbmcsca$Method)
## 
##   10x Chromium (v2) 10x Chromium (v2) A 10x Chromium (v2) B   10x Chromium (v3) 
##                3362                3222                3222                3222 
##            CEL-Seq2            Drop-seq             inDrops            Seq-Well 
##                 526                6584                6584                3773 
##          Smart-seq2 
##                 526

We will use two scRNA-seq sequencing results (10x Chromium (v2) & 10x Chromium (v3)) from pbmcsca data set today. They sequenced peripheral blood mononuclear cells from two patients using different versions of the sequencing platform, as is common in practice. The raw count matrix and the information of each gene and each cell are saved in Seurat object ‘pbmc_10x_v2’ and ‘pbmc_10x_v3’ independently. In addition, we combined the two sequencing results without any processing and stored them in the Seurat object ‘pbmc_combo’.

pbmc_10x_v2 <- pbmcsca[,pbmcsca$Method == "10x Chromium (v2)"]
pbmc_10x_v3 <- pbmcsca[,pbmcsca$Method == "10x Chromium (v3)"]
pbmc_combo <- pbmcsca[,pbmcsca$Method %in% c("10x Chromium (v2)", "10x Chromium (v3)")]

2. Analysis single-cell RNA-seq data from one experiment

Let’s start with a simple case. We only care about the results from experiment using the 10x Chromium (v3) platform, that is, we only perform data analysis on Seurat object ‘pbmc_10x_v3’.

Let’s first take a look at how many cells and genes passed QC.

In count matrix of a Seurat object, the rows represent genes and the columns represent cells.

A count matrix with 20000 rows and 10000 columns means there are 10000 cells with 20000 genes in this Seurat object

dim(pbmc_10x_v3)
## [1] 33694  3222

3222 cells with 33694 genes pass QC.

2.1 Normalization

We can use Seurat function NormalizeData() to normalize raw counts. By default, Seurat implements a global-scaling normalization method “LogNormalize” that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.

pbmc_10x_v3 <- NormalizeData(object=pbmc_10x_v3, normalization.method = "LogNormalize", 
    scale.factor = 10000)

2.2 Feature Selection

We can use Seurat function FindVariableFeatures() to select highly variable genes which have most of useful information for downstream analysis. In this example, we only select top 3,000 most variable genes to save more computing time of downstream analysis. In practice, you can select more genes (5,000 or more) to preserve more information of your scRNA-seq experiment.

pbmc_10x_v3 <- FindVariableFeatures(pbmc_10x_v3, selection.method = "vst", nfeatures = 3000)

2.3 Scaling

The single cell dataset likely contains ‘uninteresting’ sources of variation. This could include not only technical noise, but batch effects, or even biological sources of variation (cell cycle stage). As suggested in Buettner et al, NBT, 2015, regressing these signals out of the analysis can improve downstream dimensionality reduction and clustering. To mitigate the effect of these signals, Seurat constructs linear models to predict gene expression based on user-defined variables. The scaled z-scored residuals of these models are stored in the scale.data slot, and are used for dimensionality reduction and clustering.

We can use Seurat function ScaleData() to get the matrix after scaling.

pbmc_10x_v3.all.genes <- rownames(pbmc_10x_v3)
pbmc_10x_v3 <- ScaleData(pbmc_10x_v3, features = pbmc_10x_v3.all.genes)
## Centering and scaling data matrix

2.4 Principal component analysis (PCA)

We perform PCA on the scaled data. By default, the genes in are used as input, but can be defined using pc.genes. We have typically found that running dimensionality reduction on highly variable genes can improve performance. However, with UMI data – particularly after regressing out technical variables, we often see that PCA returns similar (albeit slower) results when run on much larger subsets of genes, including the whole transcriptome.

We run PCA on top 3,000 most variable genes (from section 2.2).

pbmc_10x_v3 <- RunPCA(pbmc_10x_v3, features = VariableFeatures(object = pbmc_10x_v3))
## PC_ 1 
## Positive:  LYZ, FCN1, CLEC7A, CPVL, SERPINA1, SPI1, S100A9, AIF1, NAMPT, CSTA 
##     CTSS, MAFB, MPEG1, NCF2, FGL2, VCAN, S100A8, TYMP, CST3, LST1 
##     CYBB, CFD, FCER1G, TGFBI, SLC11A1, GRN, CD14, SLC7A7, PSAP, RNF130 
## Negative:  IL32, CCL5, TRBC2, TRAC, CST7, CD69, RORA, CTSW, CD247, ITM2A 
##     TRBC1, C12orf75, IL7R, CD8A, GZMA, NKG7, CD7, LDHB, GZMH, CD6 
##     CD8B, PRF1, BCL11B, LYAR, FGFBP2, HOPX, LTB, TCF7, KLRD1, ZFP36L2 
## PC_ 2 
## Positive:  NRGN, PF4, SDPR, HIST1H2AC, MAP3K7CL, GNG11, PPBP, GPX1, TUBB1, SPARC 
##     CLU, PGRMC1, RGS18, FTH1, MARCH2, TREML1, HIST1H3H, ACRBP, AP003068.23, NCOA4 
##     PRKAR2B, CMTM5, CD9, CA2, TAGLN2, TSC22D1, CTTN, HIST1H2BJ, MTURN, TMSB4X 
## Negative:  EEF1A1, TMSB10, RPS2, RPS12, RPS18, RPLP1, RPS8, IL32, S100A4, RPLP0 
##     PFN1, NKG7, CYBA, ARL4C, HSPA8, CST7, HSP90AA1, ZFP36L2, S100A6, ANXA1 
##     CTSW, CORO1A, LDHA, CD247, GZMA, CALR, YBX1, S100A10, CFL1, ARPC3 
## PC_ 3 
## Positive:  CCL5, TMSB4X, SRGN, NKG7, ACTB, CST7, S100A4, CTSW, ANXA1, GZMH 
##     FGFBP2, PRF1, GZMA, IL32, GZMB, C12orf75, KLRD1, GNLY, GAPDH, CD247 
##     MYO1F, NRGN, ID2, ACTG1, CD7, PF4, SDPR, PPBP, HIST1H2AC, CD8A 
## Negative:  CD79A, HLA-DQA1, MS4A1, BANK1, IGHM, LINC00926, IGHD, TNFRSF13C, HLA-DQB1, CD74 
##     IGKC, HLA-DRA, BLK, CD83, ADAM28, CD22, HLA-DRB1, CD37, NFKBID, CD79B 
##     P2RX5, VPREB3, IGLC2, JUND, FCER2, TCOF1, GNG7, COBLL1, RALGPS2, CD19 
## PC_ 4 
## Positive:  IL7R, S100A12, VCAN, S100A8, SLC2A3, CD14, CSF3R, S100A9, MS4A6A, CD93 
##     IER3, FOS, RCAN3, ZFP36L2, EGR1, RP11-1143G9.4, RGCC, MGST1, LEPROTL1, CLEC4E 
##     SOCS3, VIM, THBS1, SELL, CXCL8, LTB, SGK1, TNFAIP3, MAL, IRF2BP2 
## Negative:  FCGR3A, CDKN1C, HES4, CSF1R, CKB, RHOC, ZNF703, MTSS1, TCF7L2, SIGLEC10 
##     MS4A7, FAM110A, HLA-DPA1, IFITM2, CTSL, PLD4, BATF3, SLC2A6, IFITM3, ABI3 
##     GZMB, NKG7, HLA-DPB1, FGFBP2, CTD-2006K23.1, CD79B, BID, LRRC25, GZMH, CTSC 
## PC_ 5 
## Positive:  GZMB, FGFBP2, GZMH, PRF1, CST7, NKG7, GNLY, KLRD1, GZMA, CTSW 
##     CCL5, SPON2, ADGRG1, PRSS23, KLRF1, CCL4, ZEB2, CLIC3, S1PR5, ITGAM 
##     HOPX, CEP78, CYBA, TRDC, C1orf21, MATK, TTC38, C12orf75, HLA-DRB1, HLA-DPB1 
## Negative:  IL7R, LTB, LEPROTL1, PAG1, MAL, CDKN1C, RCAN3, LEF1, TCF7, NOSIP 
##     CAMK4, LDHB, TOB1, TRABD2A, HES4, CSF1R, CKB, JUNB, TMEM123, ZFP36L2 
##     NELL2, RNASET2, CD28, BCL11B, TNFRSF25, CD27, CTSL, CD40LG, ZNF703, AQP3

How many genes to choose for PCA and how many PCs to use for downstream analysis is a complex and important issue. This is out of the scope of today’s workshop. But we highly recomand you to read this document before analyzing your own scRNA-seq data. In this document, they show you how to use some visuliztion methods to help you select most suitable number of genes for PCA and number of PCs for downstream anlysis.

2.5 2D Visulization

2.5.1 Visulization by using t-distributed stochastic neighbour embedding (t-SNE)

We should run t-SNE algorithm first. It is calculated by using PCs, we use top 30 PCs in this example case.

pbmc_10x_v3 <- RunTSNE(pbmc_10x_v3, dims = 1:30)

Then, we can draw t-SNE plot by using Dimplot() function by chosing reduction = ‘tsne’

DimPlot(pbmc_10x_v3, reduction = "tsne", label = TRUE)

We can colour points by using other information by using ‘group.by’ property. For example, we can use sequencing platform.

DimPlot(pbmc_10x_v3, reduction = "tsne", label = TRUE, group.by = 'Method')

You can see all cells in pbmc_10x_v3 data are sequenced by using 10x Chromium(v3) platform.

Let’s see where do these cells come from?

DimPlot(pbmc_10x_v3, reduction = "tsne", label = TRUE, group.by = 'Experiment')

All cells in pbmc_10x_v3 data comes from experiment 1 (Patient 1).

You can try other numbers of PCs and see what changes?

pbmc_10x_v3 <- RunTSNE(pbmc_10x_v3, dims = 1:50)
DimPlot(pbmc_10x_v3, reduction = "tsne", label = TRUE, group.by = 'Experiment')

2.5.2 Visulization by using Uniform manifold approximation and project (UMAP)

We also need to run UMAP algorithm first. It is also calculated by using PCs, we use top 30 PCs in this example case.

pbmc_10x_v3 <- RunUMAP(pbmc_10x_v3, dims=1:30)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 22:45:18 UMAP embedding parameters a = 0.9922 b = 1.112
## 22:45:18 Read 3222 rows and found 30 numeric columns
## 22:45:18 Using Annoy for neighbor search, n_neighbors = 30
## 22:45:18 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 22:45:18 Writing NN index file to temp file /var/folders/7f/9kgcv5gj5tn4dy91_qk_7p180000gn/T//Rtmp9FS2KJ/file133e32b02fe18
## 22:45:18 Searching Annoy index using 1 thread, search_k = 3000
## 22:45:18 Annoy recall = 100%
## 22:45:18 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 22:45:19 Initializing from normalized Laplacian + noise (using RSpectra)
## 22:45:19 Commencing optimization for 500 epochs, with 133242 positive edges
## 22:45:22 Optimization finished

Then, we can draw UMAP plot by using Dimplot() function by chosing reduction = ‘umap’

DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')

DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Experiment')

2.6 Clustering

Seurat uses the Louvain algorithm for clustering by default. Louvain algorithm needs a neighbor gragh as input. So, we should run FindNeighbors() function first to get a neighbor graph in the Seurat obejcet. Also, the FindNeighbors() is depend on PCs as input. We all use top 30 PCs in our example case.

pbmc_10x_v3 <- FindNeighbors(pbmc_10x_v3, dims = 1:30)
## Computing nearest neighbor graph
## Computing SNN

Then, we can run FindClusters() function to do clustering. ‘Algorithm=1’ means we are now using Lovain algorithm for clustering. You can also select ‘Algorithm=4’ to use Leiden algorithm for clustering, but you have to install Python and some Python packages first. You can also try different resolution for more or less clusters.

pbmc_10x_v3 <- FindClusters(object = pbmc_10x_v3, resolution = 0.3, algorithm=1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 3222
## Number of edges: 134293
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9386
## Number of communities: 10
## Elapsed time: 0 seconds

We can also use UMAP to visualize our clustering result. group.by=‘seurat_clusters’ means we want to use clustering result to colour data points (cells) in the UMAP.

DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'seurat_clusters')

2.7 Challenge 1: How to explain cluster 3 by using biological words?

According to the UMAP, we found that cluster 3 is independent of other cells, indicating that cluster 3 has different characteristics from other cells. We tried to find the biological differences between cluster 3 and other cells and explain them in biological language.

Therefore, first we looked for maker genes that were significantly differentially expressed in Cluster 3 compared with other clusters. We can first use FindAllMarkers() function to find marker genes for each cluster. It will take some time.

pbmc_10x_v3.markers <- FindAllMarkers(pbmc_10x_v3, min.pct = .25, logfc.threshold = .25)
## Calculating cluster 0
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9

Then, we extract the top 5 marker of cluster 3 with lowers p-value.

cluster3.markers <- pbmc_10x_v3.markers[which(pbmc_10x_v3.markers$cluster==3),]
cluster3.markers[1:5, ]

Then, we can search this marker genes in this marker gene database website. We need to remove dash(‘-’) before searching.

Based on the result provided by the database, we believe cluster 3 is high likely the cluster of B cells.

Let’s check our answer (the cell-type information are pre-saved in the pbmc_10x_v3 object, but in practice you don’t know that).

DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'CellType')

Yes! We are correct!

You can try to annotate other clusters by yourself!

The process of annotating each cluster using marker genes is also known as manual cell type annotation. We will try to do it automatically in Section 3.

2.8 Bonus: visulize marker genes

We can use violin plot to visualize one marker genes across all cell-type.

Firstly, let’s have a look the most significant marker genes of Cluster 4.

cluster4.markers <- pbmc_10x_v3.markers[which(pbmc_10x_v3.markers$cluster==4),]
cluster4.markers[1:5, ]

The most significant marker gene of Cluster 4 is VCAN. Then, we can visulize it by using violin plot.

If we want to visulize marker genes across all cell type, we should use Idents() function to identify we wants to use ‘CellType’ as a-axis. Then, we use VlnPlot() function to draw the violin plot.

Idents(object = pbmc_10x_v3) <- "CellType"
VlnPlot(pbmc_10x_v3, features = 'VCAN')

3. Automatical cell type annotation for pbmc_10x_v2

3.1 Introduction

We now have all cell type annotation of pbmc_10x_v3 object. Imagine we know have a new batch of data: pbmc_10x_v2. How can we automatically annotate cell type for it?

Remember: pbmc_10x_v2 and pbmc_10x_v3 used different sequencing platform and sequence sample different patient.

3.2 Practice: pre-processing and visulization for pbmc_10x_v2

In Section 2, we learned how to use Seurat to do pre-processing (Normalization, Feature Selection, Scaling, PCA) and visualization (t-SNE and UMAP).

Would you please help me do pre-processing and draw a UMAP for pbmc_10x_v2 object?

Answer:

# Normalize it
pbmc_10x_v2 <- NormalizeData(pbmc_10x_v2)

# Feature Selection
pbmc_10x_v2 <- FindVariableFeatures(pbmc_10x_v2, selection.method = "vst", nfeatures = 3000)

# Scale it
pbmc_10x_v2.all.genes <- rownames(pbmc_10x_v2)
pbmc_10x_v2 <- ScaleData(pbmc_10x_v2, features = pbmc_10x_v2.all.genes)
## Centering and scaling data matrix
# Do PCA
pbmc_10x_v2 <- RunPCA(pbmc_10x_v2, features = VariableFeatures(object = pbmc_10x_v2))
## PC_ 1 
## Positive:  CST3, LYZ, S100A9, LST1, FCN1, AIF1, CSTA, MNDA, S100A8, RP11-167N5.5 
##     TYROBP, RP11-1143G9.4, LGALS2, LGALS1, FOS, MS4A6A, CLEC12A, FTL, S100A12, SERPINA1 
##     VCAN, CFD, FCER1G, FGL2, CTSS, SPI1, GRN, PSAP, S100A11, CD68 
## Negative:  LTB, IL32, TRAC, CD3D, CD3E, IL7R, TRBC2, CD3G, TRBC1, CCL5 
##     CD7, LDHB, CD2, CD27, CD247, RPS18, CXCR4, BCL11B, CTSW, CCR7 
##     GZMA, CD69, KLRB1, GZMM, EEF1A1, NKG7, NOSIP, SYNE2, CST7, ITM2A 
## PC_ 2 
## Positive:  CD79A, HLA-DRA, MS4A1, CD79B, HLA-DQB1, CD74, HLA-DPA1, HLA-DRB1, HLA-DPB1, IGHM 
##     IGKC, HLA-DQA1, VPREB3, IGHD, TCL1A, MEF2C, BANK1, CD37, LINC00926, IGLC2 
##     HLA-DMB, HVCN1, JCHAIN, HLA-DMA, TNFRSF13C, IGLC3, CD24, CD22, IRF8, MARCH1 
## Negative:  HCST, NKG7, IL32, GZMA, CCL5, CST7, CTSW, S100A4, KLRB1, KLRD1 
##     TMSB4X, PRF1, CD7, CD247, CD3D, FGFBP2, GZMM, GZMH, GNLY, SRGN 
##     CD2, KLRG1, TRAC, ID2, ANXA1, CD3E, MYO1F, CCL4, LYAR, ACTB 
## PC_ 3 
## Positive:  GZMB, NKG7, GNLY, FGFBP2, PRF1, CST7, KLRF1, GZMA, KLRD1, GZMH 
##     SPON2, CLIC3, MYOM2, FCGR3A, CTSW, HOPX, CCL4, PLAC8, CYBA, C12orf75 
##     CLIC1, RHOC, MATK, PLEK, APOBEC3G, PFN1, ACTG1, EFHD2, ARPC2, CCL5 
## Negative:  IL7R, LEF1, CCR7, S100A8, S100A12, MAL, S100A9, LDHB, VCAN, RP4-594I10.3 
##     TRABD2A, NOSIP, TRAC, AIF1, RP11-167N5.5, RGCC, CD14, FOS, BCL11B, RGS2 
##     CD3E, RP11-1143G9.4, LYZ, TRAT1, FCN1, ACTN1, EPHX2, MNDA, CD27, LDLRAP1 
## PC_ 4 
## Positive:  GNLY, FGFBP2, KLRF1, MYOM2, SPON2, KLRD1, NKG7, FCGR3A, PRF1, CCL4 
##     S100A8, S100A12, GZMH, CST7, CD79B, TYROBP, IGHD, GZMA, TCL1A, S100A9 
##     CD79A, MS4A1, TTC38, EFHD2, CX3CR1, HOPX, CD160, VCAN, ADGRG1, PPBP 
## Negative:  EEF1A1, LILRA4, RPS2, LDHB, RPS18, TPM2, IL7R, SERPINF1, ITM2C, SCT 
##     PPP1R14B, PLD4, TRAC, FCER1A, DNASE1L3, FLT3, CD3E, VIM, RPLP1, MAP1A 
##     LTB, PTPRS, SMPD3, UGCG, PACSIN1, CD3D, NOSIP, ZFAT, CD27, IRF7 
## PC_ 5 
## Positive:  EEF1A1, RPS18, RPS2, RPS24, RPLP1, RPS11, TMSB10, LTB, MT-CO1, SLC25A6 
##     CORO1A, TRAC, YBX1, MT-CO3, CD37, COTL1, IL32, JUNB, IL7R, CD3D 
##     CSTB, ARPC3, S100A10, CD3E, JUN, LDHA, CFP, PPIA, LDHB, TRBC2 
## Negative:  PPBP, PF4, LILRA4, SDPR, PTCRA, SCT, TPM2, CLU, NRGN, HIST1H2AC 
##     PTPRS, MAP3K7CL, TUBB1, DNASE1L3, MAP1A, SERPINF1, GNG11, SMPD3, ZFAT, CLEC4C 
##     ITM2C, LINC00996, DAB2, PACSIN1, GZMB, TNFRSF21, ACRBP, RGS18, CD9, PPP1R14B
# Draw UMAP
pbmc_10x_v2 <- FindNeighbors(pbmc_10x_v2, dims = 1:30)
## Computing nearest neighbor graph
## Computing SNN
pbmc_10x_v2 <- RunUMAP(pbmc_10x_v2, dims=1:30)
## 22:47:08 UMAP embedding parameters a = 0.9922 b = 1.112
## 22:47:08 Read 3362 rows and found 30 numeric columns
## 22:47:08 Using Annoy for neighbor search, n_neighbors = 30
## 22:47:08 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 22:47:08 Writing NN index file to temp file /var/folders/7f/9kgcv5gj5tn4dy91_qk_7p180000gn/T//Rtmp9FS2KJ/file133e355834828
## 22:47:08 Searching Annoy index using 1 thread, search_k = 3000
## 22:47:09 Annoy recall = 100%
## 22:47:09 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 22:47:09 Initializing from normalized Laplacian + noise (using RSpectra)
## 22:47:10 Commencing optimization for 500 epochs, with 144594 positive edges
## 22:47:13 Optimization finished
DimPlot(pbmc_10x_v2, reduction = "umap", label = TRUE, group.by = 'Method')

DimPlot(pbmc_10x_v2, reduction = "umap", label = TRUE, group.by = 'Experiment')

3.3 Cell-type annotation with Seurat by using pbmc_10x_v3 as a reference

Seurat can learn cell type annotation results from one scRNA-seq data and then provide cell type annotations for another scRNA-seq data. First, we use FindTransferAnchors() function to predict which cells in two datasets are of the same cell type. Remember, we have cell type annotaion for pbmc_10x_vs and we want cell type annotation for pbmc_10x_v2. So, pbmc_10x_v3 is the reference data set and pbmc_10x_v2 is the query data set. Also, we use top 30 PCs for this analysis.

anchors <- FindTransferAnchors(reference = pbmc_10x_v3, query = pbmc_10x_v2, 
                               dims = 1:30)
## Performing PCA on the provided reference using 3000 features as input.
## Projecting cell embeddings
## Finding neighborhoods
## Finding anchors
##  Found 2992 anchors

Then, we can give cells from pbmc_10x_v2 data set a cell-type annotation by using the annotation from pbmc_10x_v3 data set.

predictions <- TransferData(anchorset = anchors, refdata = pbmc_10x_v3$CellType, 
                                 dims = 1:30)
## Finding integration vectors
## Finding integration vector weights
## Predicting cell labels

Seurat will provide a table with most likely cell type and the probability of each cell type. We can add the most likely cell type to pbmc_10x_v2 object.

pbmc_10x_v2@meta.data$CellType_Prediction <- predictions$predicted.id 

We can also use UMAP to visualize the automatic cell type annotation result.

DimPlot(pbmc_10x_v2, reduction = "umap", label = TRUE, group.by = 'CellType_Prediction')

3.3 Cell-type annotation by using Azimuth (a website tool)

Azimuth is a web application that uses an annotated reference dataset to automate the processing, analysis, and interpretation of a new single-cell RNA-seq or ATAC-seq experiment.

The input of Azimuth can be a Seurat object. In order to reduce the size of the uploaded file (retain the information useful for cell type annotation only), we first need to perform some operations on the pbmc_10x_v2 object.

DefaultAssay(pbmc_10x_v2) <- "RNA"
pbmc_10x_v2_simple <- DietSeurat(object = pbmc_10x_v2, assays = "RNA")
saveRDS(pbmc_10x_v2_simple, 'pbmc_10x_v2.Rds')

An Rds file called pbmc_10x_v2.Rds is saved at you working directory. You can check your working directory by using gerwd() function.

Then you can open Azimuth website: https://azimuth.hubmapconsortium.org/.

How to use Azimuth to do cell type annotation (We also have some slides):

  1. Find ‘References for scRNA-seq Queries’ -> Then find ‘Human - PBMC’ -> click ‘Go to App’

  2. Click ‘Browse’ -> find ‘pbmc_10x_v2.Rds’ at your working directory -> Click ‘Open’

  3. Waiting for the Rds file upload to the website

  4. Click ‘Map cells to reference’

  5. Click ‘Downloads Results’

  6. Find ‘Predicted cell types and scores (TSV)’

  7. Click ‘Downlaod’ to get the cell type annotation result: azimuth_pred.tsv

  8. Copy the tsv file (azimuth_pred.tsv) to your R working directory

The tsv file has the same data structure of Seurat annotation result (predictions). We can read the tsv file, then add the annotation result to the pbmc_10x_v2 object by using AddMetaData() function.

azimuth_predictions <- read.delim('azimuth_pred.tsv', row.names = 1)
pbmc_10x_v2 <- AddMetaData(object = pbmc_10x_v2, metadata = azimuth_predictions)

We can also use UMAP to visualize the cell type annotation result by using Azimuth.

DimPlot(pbmc_10x_v2, reduction = "umap", label = TRUE, group.by = 'predicted.celltype.l2')

3.4 Disscusion

Here is the cell type annotation results provided by the data provider.

DimPlot(pbmc_10x_v2, reduction = "umap", label = TRUE, group.by = 'CellType')

Which cell type annotation is better among the results from Section 3.2 and the results from section 3.3? Why?

What do you think makes one outcome better?

4. Intergrate two data sets

4.1 Before analyzing data

At the begining of the practical session, we combined the two sequencing results without any processing and stored them in the Seurat object ‘pbmc_combo’. Can we analyzing pbmc_combo like we analyze the scRNA-seq data from only one experiment?

4.2 Challenge 2: Why can’t we analyze pbmc_combo directly?

Try analyzing pbmc_combo like analyzing one scRNA-seq experiment data to see what problems will arise?

# Normalize it
pbmc_combo <- NormalizeData(pbmc_combo)

# Feature Selection
pbmc_combo <- FindVariableFeatures(pbmc_combo,
                                   selection.method = "vst", nfeatures = 3000)

# Scale it
pbmc_combo.all.genes <- rownames(pbmc_combo)
pbmc_combo <- ScaleData(pbmc_combo, features = pbmc_combo.all.genes)
## Centering and scaling data matrix
# Do PCA
pbmc_combo <- RunPCA(pbmc_combo, features = VariableFeatures(object = pbmc_combo))
## PC_ 1 
## Positive:  LYZ, FCN1, CST3, S100A9, CLEC7A, SERPINA1, VCAN, AIF1, CSTA, CPVL 
##     SPI1, LST1, S100A8, CTSS, MAFB, FGL2, TYMP, CFD, NCF2, NAMPT 
##     FCER1G, PSAP, CD14, MS4A6A, DUSP6, GRN, CD68, TNFAIP2, TGFBI, TYROBP 
## Negative:  IL32, TRAC, CCL5, TRBC2, LTB, RPSA, IL7R, TRBC1, CTSW, CD247 
##     LDHB, CD7, SPOCK2, CST7, GZMM, CD69, ITM2A, GZMA, SYNE2, RORA 
##     BCL11B, TCF7, NKG7, CD6, C12orf75, KLRD1, CD8A, PRF1, KLRB1, LYAR 
## PC_ 2 
## Positive:  HLA-DQB1, CD79A, MS4A1, HLA-DRA, CD79B, IGHM, HLA-DQA1, CD74, IGKC, IGHD 
##     BANK1, MEF2C, VPREB3, TCL1A, LINC00926, HLA-DRB1, HLA-DPB1, HLA-DPA1, HLA-DMB, TNFRSF13C 
##     IGLC2, HLA-DMA, CD22, JCHAIN, HVCN1, IRF8, RPS17, MARCH1, IGLC3, LTB 
## Negative:  NKG7, CST7, IL32, CTSW, ANXA1, S100A4, ARL4C, GZMH, GZMA, RPS4Y1 
##     SRGN, FGFBP2, CCL5, GZMM, RUNX3, PRF1, CD247, GNLY, GZMB, ITGB2 
##     RORA, TNFAIP3, CD8A, SYTL3, METRNL, MT-CO2, EFHD2, PIK3R1, CD7, FLNA 
## PC_ 3 
## Positive:  NRGN, PF4, SDPR, PPBP, GNG11, MAP3K7CL, TUBB1, HIST1H2AC, SPARC, CLU 
##     PGRMC1, RGS18, TMSB4X, TREML1, GPX1, HIST1H3H, MARCH2, ACRBP, PRKAR2B, CTTN 
##     CMTM5, CA2, TSC22D1, AP003068.23, GMPR, HIST1H2BJ, MTURN, NCOA4, MYL9, CD9 
## Negative:  EEF1A1, RPS12, RPSA, TPT1, TMSB10, RPLP0, CD74, PABPC1, CD79A, HLA-DPA1 
##     HLA-DPB1, HLA-DRB1, HLA-DQA1, MS4A1, KLF2, HLA-DRA, HLA-DQB1, YBX1, ARPC3, CD79B 
##     IGHM, LTB, IGKC, MT-ND3, IGHD, LINC00926, BANK1, RPS17, EZR, HSP90AA1 
## PC_ 4 
## Positive:  MT-CO3, JUND, MT-CO2, MT-ATP6, RPS4Y1, GNAS, MT-CYB, MT-ND4, MTRNR2L8, TSC22D3 
##     MT-ND1, CD83, MT-CO1, NFKBIA, MT-ND5, ODC1, MT-ND3, RBM38, CD79A, FTH1 
##     PPP1R15A, DDX3Y, GRASP, TNFAIP3, TAGLN2, UBALD2, EZR, NR4A2, MBD2, LINC00926 
## Negative:  RPS17, ARL17B, XIST, RP11-167N5.5, TMSB4X, MTRNR2L1, RP11-347P5.1, EHF, TYROBP, CFP 
##     RP11-1143G9.4, MYOM2, TMSB10, RP11-160E2.6, LGALS2, S100A11, KLRG1, PYCARD, S100A6, KLRB1 
##     AIF1, CX3CR1, CH17-373J23.1, S100A4, MNDA, KLRD1, PFN1, TMEM176B, ABI3, TMEM176A 
## PC_ 5 
## Positive:  GZMB, FGFBP2, NKG7, PRF1, GNLY, GZMH, FCGR3A, KLRD1, CST7, SPON2 
##     KLRF1, GZMA, CCL5, EFHD2, CTSW, CCL4, CLIC3, ADGRG1, HLA-DPB1, HOPX 
##     RHOC, C12orf75, PRSS23, HLA-DPA1, CLIC1, MYOM2, S1PR5, TTC38, TRDC, MATK 
## Negative:  IL7R, LDHB, LEF1, LEPROTL1, MAL, LTB, RCAN3, TCF7, NOSIP, TRAC 
##     CAMK4, CCR7, RGCC, TRABD2A, BCL11B, JUNB, RPS12, VIM, TPT1, PABPC1 
##     ZFP36L2, TRAT1, AQP3, NELL2, CD27, SATB1, CD40LG, SOCS3, TNFRSF25, OXNAD1
# Draw UMAP
pbmc_combo <- FindNeighbors(pbmc_combo, dims = 1:30)
## Computing nearest neighbor graph
## Computing SNN
pbmc_combo <- RunUMAP(pbmc_combo, dims=1:30)
## 22:48:06 UMAP embedding parameters a = 0.9922 b = 1.112
## 22:48:06 Read 6584 rows and found 30 numeric columns
## 22:48:06 Using Annoy for neighbor search, n_neighbors = 30
## 22:48:06 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 22:48:06 Writing NN index file to temp file /var/folders/7f/9kgcv5gj5tn4dy91_qk_7p180000gn/T//Rtmp9FS2KJ/file133e3672af431
## 22:48:06 Searching Annoy index using 1 thread, search_k = 3000
## 22:48:07 Annoy recall = 100%
## 22:48:07 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 22:48:08 Initializing from normalized Laplacian + noise (using RSpectra)
## 22:48:08 Commencing optimization for 500 epochs, with 276286 positive edges
## 22:48:14 Optimization finished
DimPlot(pbmc_combo, reduction = "umap", label = TRUE, group.by = 'Method')

DimPlot(pbmc_combo, reduction = "umap", label = TRUE, group.by = 'Experiment')

DimPlot(pbmc_combo, reduction = "umap", label = TRUE, group.by = 'CellType')

# Clustering
pbmc_combo <- FindClusters(object = pbmc_combo, resolution = 0.3, algorithm=1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6584
## Number of edges: 281263
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9558
## Number of communities: 15
## Elapsed time: 0 seconds
DimPlot(pbmc_combo, reduction = "umap", label = TRUE)

VlnPlot(pbmc_combo, features = 'CD14')

You can see that differences from sequencing platforms and data sources emerge as the major differences. This interferes us to detect valuable biological differences. We call differences caused by non-biological factors such as sequencing platforms or data sources batch effects. We need to first use some statistical methods to remove batch effects before we conduct downstream analysis to eliminate the interference of non-biological factors.

4.3 Solution: Intergrate multiple scRNA-seq data sets

We can also use Seurat to remove batch effect then integrate multiple data sets.

Firstly, we use FindIntegrationAnchors() function to find cells with similar biological information between two data sets. Then, the difference between cells in two data sets with similar biological information is considered as batch effect.

anchor_combo <- FindIntegrationAnchors(object.list = list(pbmc_10x_v2, pbmc_10x_v3), dims = 1:30)
## Computing 2000 integration features
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 7760 anchors
## Filtering anchors
##  Retained 4972 anchors

Then, we can use IntegrateData() function to remove batch effect and integrate two data sets.

pbmc_combo <- IntegrateData(anchorset = anchor_combo, dims = 1:30)
## Merging dataset 2 into 1
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data

We can then visualize the integrated scRNA-seq data by using UMAP.

We need scaling data first for PCA. Then, Run PCA for UMAP.

# Scaling
pbmc_combo.all.genes <- rownames(pbmc_combo)
pbmc_combo <- ScaleData(pbmc_combo, features = pbmc_combo.all.genes)
## Centering and scaling data matrix
# PCA
pbmc_combo <- RunPCA(pbmc_combo, features = VariableFeatures(object = pbmc_combo))
## PC_ 1 
## Positive:  IL32, TRAC, CD3D, CCL5, IL7R, TRBC2, CTSW, LTB, GZMA, TRBC1 
##     NKG7, CST7, CD247, CD7, CD27, KLRG1, KLRB1, KLRD1, LDHB, PRF1 
##     LYAR, ITM2A, CD69, BCL11B, LINC00861, TRGC2, ARL4C, CMC1, HOPX, GZMH 
## Negative:  LYZ, CST3, LST1, S100A9, AIF1, FCN1, CSTA, MNDA, RP11-167N5.5, RP11-1143G9.4 
##     S100A8, LGALS2, CLEC12A, SERPINA1, TYROBP, CFD, SPI1, MS4A6A, FGL2, FTL 
##     CTSS, GRN, VCAN, S100A12, FOS, LGALS1, TYMP, PSAP, CLEC7A, NCF2 
## PC_ 2 
## Positive:  NKG7, GZMA, S100A4, CST7, IL32, CTSW, CCL5, KLRD1, PRF1, SRGN 
##     GZMH, FGFBP2, KLRG1, CD247, KLRB1, CD3D, CD7, MYO1F, ID2, ANXA1 
##     PFN1, TMSB4X, CCL4, ACTB, S100A6, CMC1, TRAC, LYAR, FCGR3A, GNLY 
## Negative:  CD79A, MS4A1, CD79B, HLA-DQB1, IGHM, HLA-DRA, IGKC, HLA-DQA1, IGHD, VPREB3 
##     BANK1, TCL1A, CD74, HLA-DPA1, MEF2C, LINC00926, HLA-DPB1, HLA-DRB1, IGLC2, JCHAIN 
##     HVCN1, CD37, HLA-DMB, IGLC3, TNFRSF13C, CD24, CD22, IRF8, HLA-DMA, FCRLA 
## PC_ 3 
## Positive:  PPBP, PF4, SDPR, TUBB1, NRGN, HIST1H2AC, CLU, CTTN, MAP3K7CL, GNG11 
##     RGS18, ACRBP, CD9, PDZK1IP1, GP9, CLDN5, GMPR, TSC22D1, RP11-185E8.1, F13A1 
##     PTCRA, C2orf88, NCOA4, ACTN1, MMD, TMSB4X, TMEM40, SNCA, C19orf33, PRKAR2B 
## Negative:  EEF1A1, RPS2, RPS18, MT-CO1, TMSB10, RPLP1, CORO1A, CYBA, NKG7, CFL1 
##     PFN1, PRF1, CST7, PLAC8, GZMA, CTSW, ACTG1, KLRD1, FGFBP2, ARPC3 
##     YBX1, CD74, GZMH, CD37, GZMB, HOPX, LSP1, SUB1, IFITM2, CALR 
## PC_ 4 
## Positive:  IL7R, LDHB, RPS18, RPS2, LTB, LEF1, RPLP1, TRAC, NOSIP, CCR7 
##     EEF1A1, CD27, MAL, TRAT1, BCL11B, RCAN3, CD40LG, TRABD2A, VIM, MYC 
##     CD3D, AQP3, NELL2, JUNB, RP11-18H21.1, PASK, TNFRSF25, ZFP36L2, INPP4B, RGCC 
## Negative:  GZMB, FGFBP2, KLRF1, GNLY, FCGR3A, NKG7, MYOM2, PRF1, SPON2, KLRD1 
##     GZMH, CST7, CLIC3, CCL4, RHOC, GZMA, PF4, PLEK, TUBB1, PPBP 
##     ACTB, SDPR, CCL5, CLIC1, PTGDR, HOPX, CTSW, IGFBP7, HIST1H2AC, C12orf75 
## PC_ 5 
## Positive:  S100A12, S100A8, VCAN, S100A9, CD14, RP11-167N5.5, RP11-1143G9.4, CSF3R, RBP7, RGS2 
##     FOS, CYP1B1, LYZ, SGK1, FPR1, FCN1, QPCT, MGST1, CYP27A1, MNDA 
##     ASGR1, KLRF1, CREB5, GNLY, MYOM2, FGFBP2, CYBB, CD93, CEBPD, KLRD1 
## Negative:  CDKN1C, PLD4, ENHO, LILRA4, C1QA, LDHB, HES4, FCER1A, RGS10, FLT3 
##     ITM2C, FAM110A, IL3RA, SERPINF1, IL7R, LTB, ADGRE1, LYPD2, C1QB, RNASET2 
##     TPPP3, CSF1R, RHOC, CLEC10A, MS4A4A, TUBA1B, TCF7L2, SCT, TPM2, TRAC
# UMAP
pbmc_combo <- FindNeighbors(pbmc_combo, dims = 1:30)
## Computing nearest neighbor graph
## Computing SNN
pbmc_combo <- RunUMAP(pbmc_combo, dims=1:30)
## 22:49:12 UMAP embedding parameters a = 0.9922 b = 1.112
## 22:49:12 Read 6584 rows and found 30 numeric columns
## 22:49:12 Using Annoy for neighbor search, n_neighbors = 30
## 22:49:12 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 22:49:12 Writing NN index file to temp file /var/folders/7f/9kgcv5gj5tn4dy91_qk_7p180000gn/T//Rtmp9FS2KJ/file133e36074bf3b
## 22:49:12 Searching Annoy index using 1 thread, search_k = 3000
## 22:49:13 Annoy recall = 100%
## 22:49:14 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 22:49:14 Initializing from normalized Laplacian + noise (using RSpectra)
## 22:49:14 Commencing optimization for 500 epochs, with 293084 positive edges
## 22:49:20 Optimization finished
DimPlot(pbmc_combo, reduction = "umap", label = TRUE, group.by = 'Method')

DimPlot(pbmc_combo, reduction = "umap", label = TRUE, group.by = 'Experiment')

DimPlot(pbmc_combo, reduction = "umap", label = TRUE, group.by = 'CellType')

What are the difference between before and after batch effect removal?